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Abstract 

In the past years, the folding kinetics of many small single-domain proteins has 
been characterized by mutational $-value analysis. In this article, a simple, essen- 
tially parameter-free model is introduced which derives folding routes from native 
structures by minimizing the entropic loop-closure cost during folding. The model 
predicts characteristic folding sequences of structural elements such as helices and 
/3-strand pairings. Based on few simple rules, the kinetic impact of these struc- 
tural elements is estimated from the routes and compared to average experimental 
^-values for the helices and strands of 15 small, well-characterized proteins. The 
comparison leads on average to a correlation coefficient of 0.62 for all proteins with 
polarized <I>-value distributions, and 0.74 if distributions with negative average 
values are excluded. The diffuse lvalue distributions of the remaining proteins are 
reproduced correctly. The model shows that $-value distributions, averaged over 
secondary structural elements, can often be traced back to entropic loop-closure 
events, but also indicates energetic preferences in the case of a few proteins gov- 
erned by parallel folding processes. 



1 Introduction 



Small single- domain proteins with less than 100 amino acids typically are two- 
state folders. 1-3 These proteins fold from the denatured to the native state 
without populating experimentally detectable intermediate states. 2 In recent 
years, the folding kinetics of many two-state proteins has been characterized 
by mutational $-value analysis. A $-value is a measure for the impact of a 
mutation on the folding kinetics, denned as 

RTlnk'/k 
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where k and AG are the folding rate and stability of the wildtype protein, 
and k! and AG' are the corresponding quantities of the mutant. 2 ' 4 For various 
two-state proteins, detailed $-value distributions have been obtained by con- 
sidering many single-residue mutations throughout the protein chains. A cen- 
tral question is why some proteins have polarized $-value distributions, while 
others have diffuse distributions. In a polarized distribution, the $-values for 
mutations in some of the secondary structural elements of the protein are 
significantly larger than the values in other secondary elements. In a diffuse 
distribution, the average ^-values for the secondary elements of the protein 
are rather similar. 

Several results seem to indicate that the folding kinetics of two-state proteins is 
dominated by their native-state topology. 5 Most importantly, the folding times 
of two-state folders have been found to correlate with the relative contact order 
(CO) of their native structures. 6-10 The relative CO of a protein is the average 
contact order, or 'localness', \i — j\ of all native contacts divided by the 

chain length of the protein. Here, % and j indicate the sequence positions of 
two residues in contact. The correlation holds for folding times over 6 orders 
of magnitude, from microseconds for a-helical proteins with low relative CO 
to seconds for /3-sheet-containing proteins with high relative CO. Comparable 
correlations with folding times have also been found for other measures of 
native-state topology. 11-16 

However, reproducing detailed $-value distributions in theoretical models 
which are based on native-state topology has been proven to be difficult. 
Several theoretical models derive folding routes or $- values from native struc- 
tures. 17-34 Some of these models assume that the amino acid residues can be 
in either of two states, native-like folded or unfolded. 17-24 In this respect, the 
models are similar to the Zimm-Bragg model for helix-coil transitions where 
residues can either be in a helix or in a coil state, 35 or to Ising models where 
particles can either have spin up or down. Other models use explicit chain rep- 
resentations of the proteins and simplified Go-type potential energies which 
impose the native structure by postulating favorable interaction energies only 
between pairs of residues that are in contact in the native structure. 25-31 The 
unfolding kinetics of proteins has also been considered in Molecular Dynam- 
ics simulations with all-atom models. 36-41 Some of these models have been 
used to calculated ^-values either for a single protein or a small number of 
proteins. 17-20 ' 25 ' 27 ' 28,30 ' 31 A systematic comparison for a set of 19 proteins 
has been performed by Aim et al. with an Ising-like model. 22 For more than 
half of the proteins, Aim et al. obtain correlation coefficients r from 0.41 to 
0.88 between theoretical and experimental ^-values, and for 14 of the 19 pro- 
teins, the theoretical $-values were better than random permutations of the 
experimental values. Kameda 33 has considered a Gaussian chain model with 
a Go-type interaction potential and obtains positive correlation coefficients r 
between 0.12 and 0.65 for 7 out of 12 proteins. More recently, Garbuzynskiy et 
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al. 34 have reproduced the $-value distributions of 17 proteins with an average 
correlation coefficient of 0.54. 

The model presented here focuses on average $- values for the secondary struc- 
tural elements of a protein. The starting point of the model are native contact 
maps. The native contact map of a protein is a matrix in which element 
equals 1 if the two residues % and j are in contact in the native structure, and 
otherwise. The contacts in the native contact map of a protein typically are 
arranged in clusters. These contact clusters correspond to structural elements 
such as a-helices and /3-strand pairings. 

In the first step, the model derives folding routes from native contact maps. In 
this step, the model considers all sequences in which the contact clusters, or 
structural elements, can be formed. The key assumption of the model is that 
the dominant folding routes can be identified as those sequences of events 
which minimize the loop-closure cost and, hence, the entropic barriers during 
folding. The loop-closure cost of a folding sequence simply is defined as the 
sum of loop lengths for forming the clusters along that sequence. These loops 
lengths are estimated via the graph-theoretical concept of effective contact 
order ECO 42 ' 43 (see Fig. 1). The ECOs and, thus, the loop-closure cost for 
forming nonlocal structural elements typically can be reduced by the previous 
formation of other, more local structural elements. 

In the second step, the model estimates the kinetic impact of contact clusters 
and secondary structural elements from the folding routes. In the model, the 
kinetic impact of a contact cluster depends on how often the cluster appears on 
the folding routes to (other) nonlocal clusters, and on the ECOs of the cluster. 
The kinetic impact derived from the folding routes is compared to average 
experimental $- values for the secondary structural elements. To test the model 
systematically, 15 proteins are considered which (i) are small in the sense that 
they have less than 10 contact clusters, and which (ii) are well-characterized 
in the sense that $-values for at least 10 residue positions are available. The 
comparison between kinetic impact and average <£>-values leads on average 
to a correlation coefficient of 0.62 for all 12 proteins with polarized $-value 
distributions (see Fig. 3), and to an average correlation coefficient of 0.74 if 
three proteins with negative average $- values are excluded. The three proteins 
have negative average $- values below -0.1 in one of the secondary elements, 
which are difficult to interpret. 44 ' 45 The remaining proteins have diffuse $- 
value distributions with similar average values in the secondary elements. In 
agreement with the experiments, the distribution of kinetic impact for these 
proteins is also diffuse. The model thus shows that the polarized of diffuse 
shapes of most averaged $-value distributions can be traced back to native- 
state topologies. 

The minimum-ECO-routes defined here represent maybe the simplest possi- 
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ble topology-based modeling of protein folding routes. The prediction of these 
routes requires the definition of contact maps and contact clusters, but no 
parameter fitting since the routes are defined as minima of a loop-closure cost 
function in the space of possible folding sequences. Why can such a simple pre- 
diction of folding routes, in combination with a few rules for estimating the 
kinetic impact of structural elements, reproduce central aspects of mutational 
experiments? The reason seems to be that the barrier for protein folding is 
entropic. Furthermore, the relevant entropy here should be loop-closure en- 
tropy, since other entropic contributions like the entropy-loss for side-chain 
'freezing' in contact clusters, once the loop is closed, should be rather inde- 
pendent of the specific route, i.e. of the sequences in which the clusters are 
formed. Clearly, this simple modeling has its limitations. The model is lim- 
ited to average ^-values for secondary elements, since the realistic modeling 
of detailed <3>-value distributions requires also energetic characteristics of the 
specific mutations. 46 Another limitation is that the model can not address 
folding rates. The modeling of folding rates requires an additional estimate 
for the intrinsic, route-independent formation times of the contact clusters, 
besides loop-closure. In a previous related model, these intrinsic cluster for- 
mation times have been estimated via the number of steps required for 'zipping 
up', or 'propagating', a contact cluster after the initial loop-closure step. 47 ' 48 
This previous model has five parameters, which were fitted to the folding rates 
of 24 two-state folders, and considers a more complex set of partially formed 
zipping states of the clusters. 

The model presented here is purely topology-based in the sense that it does 
not use any sequence-specific information. A central question is whether purely 
topology-based models can account for the experimentally observed differences 
in the folding kinetics between proteins with similar folds, or similar overall 
fold topology. A famous example are protein L and G, which both have a 
central a-helix and two rather symmetric /5-hairpins at the chain ends. In- 
triguingly, the structural symmetry is 'broken' in the <£>-value distributions, 
and in each of these protein in a different way: protein L has the largest 
$-values in the N-terminal hairpin, 50 and protein G in the C-terminal hair- 
pin. 51 In the model presented here, native-state topology is captured by the 
topology of the native contact maps. Protein L and G have very similar folds, 
but nonetheless small differences in their contact maps. This leads to differ- 
ent folding routes which reproduce the observed symmetry breaking for the 
two proteins. Other groups have used sequence-specific interaction energies in 
topology-based models to account for these differences between protein L and 
q 28,30,31,33 rj^g p resen t model traces the symmetry breaking of protein L and 
G back to native-state topology, but suggests that sequence-specific energetic 
contributions may affect the folding kinetics of Sso7d and CspB. According to 
the model, these proteins are governed by parallel folding processes with simi- 
lar loop-closure cost. However, the experimental $- value distributions seem to 
indicate that one of the parallel processes dominates the kinetics, presumably 
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due to specific energetic interactions which are not considered in the model. 



2 The model 

2.1 Folding routes 

The starting point of the model are native contact maps. The native contact 
map of a protein is a matrix in which the element equals 1 if the two 
residues i and j are in contact in the native structure, and otherwise. Here, 
two residues are taken to be in contact if the distance between their C a or 
Cp atoms is less than 6 A, and if they are not nearest or next-nearest neigh- 
bors in the sequence. The native contacts are grouped into contact clusters 
(for details, see Methods section). These contact clusters correspond to the 
structural elements of the protein: helices, /3-strand pairings, and tertiary in- 
teractions of helices or /9-sheets. The contact maps and contact clusters of 
the 15 proteins considered here are shown in Fig. 2. The contact clusters of a 
protein can be divided into local and nonlocal clusters. Local clusters contain 
at least one local contact with small contact order CO =\i — j\ < 10, 
whereas nonlocal clusters do not contain any such local contacts. 

In the model, folding routes are derived from the loop-closure dependencies be- 
tween the contact clusters. To determine the loop-closure relations, all possible 
sequences are considered in which the clusters can be formed. For a nonlocal 
cluster, the length of the loop which has to be closed to form the cluster 
contacts depends on these sequences of cluster formation. In other words, it 
depends on which other clusters have been formed previously. A simple ex- 
ample with only four contact clusters is CI2 (see Fig. 2). The Pifa cluster of 
CI2 consists of nonlocal contacts between the two chain ends. Forming these 
contacts from the fully unfolded state requires the closure of a relatively large 
loop, and hence costs a large amount of loop-closure entropy. However, form- 
ing one or several of the structural elements a, P2P3, or p 3 p 4 prior to P1P4 
brings the two chain ends into closer spatial proximity and reduces the length 
of the loop which has to be closed to form P1P4. 

The length of the loop which is closed to form a specific contact between two 
residues is estimated here via the concept of effective contact order (ECO). 42 ' 43 
The ECO of the contact is the number of steps along the shortest path between 
these two residues. Each 'step' either is (i) a covalent bond between consecutive 
residues in the chain, or (ii) a previously formed noncovalent contact (see 
Fig. 1). In contrast, the contact order (CO) only takes into account steps 
of type (i) and hence measures the sequence separation of the two residues. 
Unlike the ECO, the CO is independent of the folding route, the sequence in 
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which contacts are formed. 

The key assumption of the model is that folding routes which involve only 
closures of relatively small loops dominate the folding process. These routes 
minimize the entropic loop-closure barriers during folding. To determine the 
minimum-entropy-loss routes, all possible sequences of cluster formation are 
considered. Sequences of cluster formation here are called folding sequences. 
The formation of each contact cluster in a folding sequence requires to close 
a loop. The length of this loop is estimated as the minimum ECO among all 
cluster contacts, the cluster ECO. The cluster ECO thus is an estimate for 
the length of the shortest loop that has to be closed to form the cluster in a 
given partially folded conformation 1 . Suppose we have a sequence of clusters 
C\Ci . . . C n . Since no contacts have been formed prior to C 1 , the ECO l\ of 
this cluster simply is the minimum CO among the cluster contacts. For the 
other clusters C; in the folding sequence, the cluster ECO is the minimum 
ECO among the cluster contacts, given the contacts of the previously formed 
clusters Ci, C 2 , . . . , Cj_i. This leads to a sequence of cluster ECOs, or loop 
lengths, e 1 ,£ 2 , ■■ - ,C 

For each folding sequence C\C 2 ■ ■ ■ C n , the total loop-closure cost can be de- 
fined as s = Yh=i f(^i) where £i are the cluster ECOs along the sequence, 
and f(£i) is a weighting function which increases with the loop length For 
simplicity, the linear weighting function f(£i) = U is used here. 49 This linear 
approximation for the free-energy cost of loop closure is not unreasonable since 
the range of relevant ECOs here only spans roughly one order of magnitude, 
from 2 to 20 or 30 (see Table 1). The total loop-closure cost then simply is 
the sum of ECOs s = J2i=i ^ f° r a ^ clusters along the sequence. 

The minimum-ECO sequences to a given cluster C n are simply defined as local 
minima of the loop-closure cost s in the space of all possible folding sequences 
to C n . In this space, the neighbors of a given folding sequence C\C 2 ■ ■ ■ C n 
are those sequences which are obtained either by deleting one or several of the 
clusters from C\C 2 ■ ■ ■ C n , or by adding one or several 'new' clusters somewhere 
in the sequence (see also Methods section). In principle, two neighboring fold- 
ing sequences can have the same local minimum value of s. In this case, the 
longer sequence among the two is selected as the minimum-ECO sequence. 

Finally, all minimum-ECO sequences which consist of the same set of clusters 
are taken to represent the same minimum-ECO route. These sequences have 
the same loop-closure cost s and differ only by permutations from each other, 

1 More precisely, the cluster ECO is an estimate for the length of the shortest loop 
that has to be closed to 'initiate' the cluster, i.e. to form the cluster contact (s) with 
minimum ECO. After 'initiation', the cluster is thought to be 'zipped up' in a series 
of small-loop-closure steps. 47 ' 48 These zipping steps do not depend on the folding 
sequence. Therefore, they are not considered here. 
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which indicates parallel folding processes on the route. Suppose the ECO of the 
nonlocal cluster C3 is only affected by the two local clusters C\ and C 2 . Since 
the ECOs of the local clusters C\ and C 2 are independent of each other, the 
two sequences CxC 2 C 3 and C 2 CiC 3 then both are minimum-ECO sequences, 
representing the same minimum-ECO route. On this minimum-ECO route, 
the two local clusters C\ and C 2 form in parallel, prior to the nonlocal cluster 

c 3 . 

Table 1 summarizes the loop-closure hierarchies on the minimum-ECO routes 
for the proteins considered here. For each nonlocal cluster of a protein, all 
clusters formed prior on the minimum-ECO route are shown. For some nonlo- 
cal clusters, there are multiple minimum-ECO routes. These multiple routes 
correspond to different local minima of the loop-closure cost s in the space 
of folding sequences. However, local minima with a loop-closure cost s which 
is by 10 or more larger than the global minimum are neglected. These local 
minima represent folding routes with significantly larger entropic barriers. 



2.2 Kinetic impact of secondary structural elements 

The most important kinetic data for two-state folders are <3>-values, which 
reflect the impact of mutations on the folding kinetics (see eq. (1)). A <3>- 
value distribution for a protein is obtained by considering many single-residue 
mutations throughout the protein chain. For comparison with the model, the 
experimental <3>-value distributions here are averaged over whole secondary 
structural elements (helices or sheets). These average $- values typically are 
positive and indicate the relative 'kinetic importance', or 'kinetic impact', of 
the secondary structural elements. For example, a relatively large average <3>- 
value for a secondary element indicates that mutations in this element have a 
strong impact on the folding rate. 

In order to compare with average experimental <3>-values, the kinetic impact 
of secondary structural elements here is estimated from the loop-closure hier- 
archies summarized in Table 1. For this purpose, we first have to the consider 
the kinetic impact of the contact clusters. In a semi- quantitative approach, 
the kinetic impact of contact clusters and secondary structural elements here 
is divided into high (H), medium (M), or low (L). 

First, it seems reasonable to assume that the kinetic impact of a cluster should 
be related to how often it appears on the minimum-ECO routes to other 
clusters. Suppose a local cluster appears on minimum-ECO routes to all non- 
local clusters. Mutations affecting the formation of this cluster then should 
strongly affect the overall folding kinetics. Hence, the cluster has a high kinetic 
impact. To quantify this notion, the occurrence number n of a cluster is defined 
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as the number of times it appears on all routes to all (other) nonlocal clusters. 
In other words, n simply is the number of times the cluster occurs in the third 
column of Table 1. In terms of occurrence numbers, the first rule is: 

(1) The kinetic impact of a cluster is high (H) if its occurrence number n on 
the minimum-ECO routes is larger than or equal to §n max . Here, n max is 
the maximum value of n among all clusters of the protein. The impact 
of the cluster is medium (M) for |n max < n < §n max . The impact is low 
(L) for n < §n max . 

Second, the kinetic impact of nonlocal clusters should also be affected by 
the cluster ECO. Suppose a nonlocal cluster has a high cluster ECO on all 
minimum-ECO routes. This means that forming the cluster always involves 
the closure of a relatively large loop. It seems reasonable to assume that the 
kinetic impact of the cluster then is high, since the contacts of these clusters 
have to balance a relatively high loop-closure entropy. In other words, the 
formation of the cluster and, hence, the overall folding kinetics should be 
highly sensitive to mutations affecting the cluster contacts. The second rule 
is: 

(2) A nonlocal cluster has a high (H) kinetic impact if the ECO of this cluster 
is larger than 10 an all routes. The kinetic impact is medium (M) if the 
smallest cluster ECO has a value from 6 to 10, unless rule (1) specifies 
high impact. 

According to the rules (1) and (2), the kinetic impact of a cluster thus is low 
if its occurrence number is small, and the cluster ECO is not larger than 5. 

Finally, suppose a protein has two nonlocal clusters C\ and C 2 which fold 
in parallel. This means that the cluster C\ does not appear on the minimum- 
ECO routes to C 2 , and vice versa. In general, the loop-closure cost for forming, 
e.g, C\ can be significantly larger than the loop-closure cost for forming C 2 . 
It seems reasonable that clusters appearing on the minimum-ECO routes to 
Ci should then have a higher kinetic impact than clusters appearing only on 
minimum-ECO routes to C 2 , since the entropic loop-closure barrier for forming 
C\ is significantly larger. Therefore, the third rule is: 

(3) If two nonlocal clusters C\ and C 2 do not occur on minimum-ECO routes 
to other clusters and have minimum loop-closure costs S\ and s 2 with 
s\ > s 2 + 5, the cluster occurrences on the routes to C 2 are not taken into 
account in rule (1). In particular, clusters which appear only on routes 
to C 2 have a low kinetic impact, independent of their ECO. 

The rules (1), (2), and (3) define the kinetic impact of clusters. The translation 
into kinetic impact of secondary elements (strands or helices) is straightfor- 
ward. The kinetic impact of a secondary element is high (H) if it has contacts 
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in a cluster with high kinetic impact, and low (L) if it only has contacts in 
clusters with low kinetic impact. The kinetic impact of a secondary element is 
medium (M) if it has contacts in clusters with medium kinetic impact, but no 
contacts in clusters with high kinetic impact. As an example, the high kinetic 
impact of the clusters and (3k(3i of a protein results in a high kinetic im- 
pact of the secondary elements on, (3k, and (3%. The relation between secondary 
elements and contact clusters is summarized in the cluster labels of Fig. 2. 

Table 2 shows average experimental $- values and kinetic impact for the strands 
and helices of the 15 proteins considered here. To illustrate the rules (1) and 
(2), consider for example the src SH3 domain. This protein has two nonlocal 
clusters, RT-/34 and (3\(3§ (see Fig. 2). The clusters (3 2 (3^ and (3^(3 4 appear on 
the minimum-ECO routes to both nonlocal clusters (see Table 1) and, hence, 
have the occurrence number 2. The cluster RT only appears on the route to 
/3i/?5 and, hence, has occurrence number 1. According to rule (1), the kinetic 
impact of (3 2 (3^ and (3^(3^ thus is high (H), and the kinetic impact of RT is 
medium (M). According to rule (2), the kinetic impact of the cluster RT-/^ is 
medium since it has the cluster ECO 10. Finally, the kinetic impact of (3i(3$ 
is low (L) since it has a small cluster ECO of 5 and occurrence number 0. 
Therefore, the kinetic impact of the strands (3 2 , (3 3 , and (3 4 is high, the kinetic 
impact of RT is medium, and the kinetic impact of (3\ and (3$ is low, in perfect 
agreement with the average $- values (see Table 2). 

Rule (3) affects the proteins U1A and L23. In the case of U1A, the cluster 
a±a 2 does not occur on the minimum-ECO routes to the two other nonlocal 
clusters (3i(3^ and (3i(3 4 and has a significantly smaller loop-closure cost than 
(3i(3 4 . Therefore, a 2 has a low kinetic impact, since it only appears on the 
minimum-ECO route to a x a 2 . In the case of L23, the nonlocal clusters t- 
a 2 folds in parallel to (3 2 (3 4 , with significantly smaller loop-closure cost. As 
a consequence, the kinetic impact of (3\ and a± is low since these secondary 
elements are only involved in the folding of t-a 2 . 



3 Results and discussion 

To evaluate the model, is is useful to distinguish between proteins with po- 
larized and diffuse <3>-value distributions. Here, this distinction is based on 
the average ^-values for the helices and strands. A distribution is polarized if 
the ^-values for some of the secondary elements are significantly larger than 
for other secondary elements. To quantify this notion, a $-value distribution 
here is defined as polarized if at least two average $- values are by more than 
a factor 2.5 smaller than the maximum value of the distribution. A $-value 
distribution is diffuse if this is not the case. In a diffuse distribution, all or all 
except one of the average values are larger than 40% of the maximum among 
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these values. An analogous definition can also be applied to the distribution 
of kinetic impact derived from the minimum-ECO routes. The distribution 
is diffuse if all or all except one of the secondary elements have high kinetic 
impact. 

According to this definition, 3 among the 15 proteins considered here have 
a diffuse $-value distribution. These proteins are CI2, S6, and FNfnlO. In 
agreement with the experiments, the distribution of kinetic impact for the 
secondary structural elements of these proteins is also diffuse (see Table 2). 
The remaining 12 proteins have polarized <3>-value distributions. Fig. 3 shows 
the correlation coefficient r between average <3>-values and kinetic impact for 
each of these proteins. The calculate the correlation coefficients, the values 0, 
1, and 2 are assigned to the kinetic impact L, M, and H. 2 The correlation 
coefficient r can attain values in the range -1 to 1 where 1 means 'perfect' cor- 
relation (proportionality), means no correlation, and negative values mean 
anticorrelation. 

Three of the lowest correlation coefficients are obtained for the a-spectrin SH3 
domain, protein G, and ACBP (see Fig. 3). These proteins have clearly nega- 
tive average $- values (smaller than -0.1) in one of the secondary elements. For 
the comparison with kinetic impact, the negative average $- values were sim- 
ply taken to be zero. However, excluding the helix a 2 of ACBP with negative 
average $- value from the correlation analysis leads to a correlation coefficient 
of 0.94, instead of 0.02. For the a-spectrin SH3 domain, excluding the strand 
/?2 from the comparison leads to a correlation coefficient of 0.69 instead of 
0.37. Thus, the relatively low correlation coefficients for these two proteins 
can be traced back directly to the secondary elements with negative <£>-values. 

Two other proteins with relatively low correlation coefficients in Fig. 3 are 
Sso7d and CspB. These proteins have in common that the nonlocal clusters 
fold in parallel on the minimum-ECO routes. In the case of CspB, the nonlocal 
clusters are /3i/?4 and fofe. Since the total loop-closure cost of the two parallel 
folding processes leading to these clusters are similar (see Table 1), the model 
takes them to be equally important for the kinetics. However, the experimental 
$- values seem to indicate that the folding process leading the Pi/3^ has a larger 
impact on the kinetics than the parallel process leading to 0305. The strands 
Pi to (3 3 of the two clusters f3if3 2 and P2P3 which are formed prior to PiP^ have 
relatively large average $- values. In contrast, the strands of the cluster fyfe, 
which is formed prior to /3 3 /3 5 on the parallel folding process, have significantly 
smaller average <3>-values. In the case of Sso7d, the three nonlocal clusters a- 

2 Any other 'equidistant' values a, a+b, and a+2b with b > for the kinetic impact 
L, M, and H result in the same correlation coefficients. Correlation coefficients are 
only given for proteins with polarized distributions since the do not reflect the 
quality of the modeling in the case of diffuse distributions with rather similar average 
$-values for the secondary elements. 
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@3, Pi~Pb an d a- Pi fold in parallel, with comparable loop-closure cost. Here, 
the experimental $-values seem to indicate that the folding process leading 
to a~p 3 dominates the folding kinetics. According to the model, the clusters 
formed prior to a~p 3 are P3P4, P3P5, and a. The secondary elements of these 
clusters have medium are large average ^-values, whereas the $-values of 
the remaining secondary structural elements Pi, P2, and Gi are significantly 
smaller. In both proteins, specific energetic interactions, which are not taken 
into account in the model, may be responsible for the dominance of one the 
parallel folding processes with similar entropic loop-closure barriers. 

For the remaining majority of proteins, the model reproduces the polarized 
$-value distributions with relatively large correlation coefficients. This shows 
that <J>-value distributions averaged over secondary elements are dominated 
by native-state topology. In the model, the native-state topology is captured 
by the topology of the native contact maps, or more precisely, by the ECO- 
dependencies between the contact clusters. Interestingly, the model is able to 
reproduce the experimentally observed differences in the $-value distributions 
of protein L and G without sequence-specific information. These two proteins 
have very similar folds, but nonetheless small differences in their contacts 
maps. Whereas protein L has a small tertiary aPi cluster, protein G has 
a tertiary olPi cluster. This results in different folding routes and different 
distributions of kinetic impact (see Tables 1 and 2). In the case of protein L, 
the N-terminal hairpin P1P2 has higher kinetic impact the C-terminal hairpin 
/? 3 /? 4 , in agreement with the average $-values. In the case of protein G, the 
kinetic impact and average $- values are larger for the C-terminal hairpin /3 3 /5 4 . 
Other groups have used sequence-specific interaction energies to reproduce 
these differences between protein L and G. 28,30 ' 31,33 

The folding routes of the model are hierarchic in the sense that the formation 
of nonlocal structural elements typically requires the prior formation of other, 
more local structural elements. It is important to note that the hierarchic 
folding routes do not contradict cooperative two-state folding with a charac- 
teristic single-exponential relaxation dynamics. We have recently developed 
a free-energy based model with similar loop-closure dependencies. 49 In this 
model, two-state folding cooperativity is reproduced when assuming that the 
local structural elements are unstable. On one hand, the nonlocal structural 
elements then stabilize the overall fold and, thus, also the local elements. On 
the other hand, the local structural elements reduce the loop-closure entropies 
for forming the nonlocal elements. 3 On the energy landscapes, the formation 
of local structural elements then corresponds to uphill steps in free energy, 



3 Similar in spirit, the diffusion-collision model of Karplus and Weaver assumes that 
individual microdomains such as helices are unstable. 52 ' 53 A direct, energetic local- 
nonlocal coupling has been recently used by Kaya and Chan 54 to obtain two-state 
cooperitivity in a simple lattice model. 
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and the formation of nonlocal structural elements to steps downhill in free 
energy, with characteristic barrier or 'transition' states in between. For an 
a-helical protein, the hierarchy of local and nonlocal structural elements cor- 
responds to a hierarchy of secondary and tertiary elements, 55 since the local 
structural elements are individual helices. However, this correspondence is not 
general: a /3-hairpin, for example, is a local structural element, but involves 
both secondary and tertiary structure formation. 



4 Conclusions 

The model presented here derives folding routes of proteins and the 'kinetic 
impact' of secondary structural elements from native structures. In a first 
step, minimum-entropy-loss routes are derived from the native contact maps. 
This step reveals characteristic loop-closure dependencies between local and 
nonlocal structural elements. In a second step, the model estimates the ki- 
netic impact of secondary elements from the folding routes. In a systematic 
comparison for a large set of small and well-characterized proteins, relatively 
high correlation coefficients are obtained between kinetic impact and average 
experimental $-values of the secondary elements. The model thus indicates 
that the shape of <£>-value distributions is dominated by native-state topology. 
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6 Methods 

Contact clusters 

The native contacts are grouped into contact clusters. In general, two contacts 
and (k,l) are taken to be in the same cluster if they are close together 
on the contact map, according to the distance criterion \i — k\ + \j — l\ < 4. 
However, peripheral contacts which have a minimum distance \i — k\ + 
\j — l\ = 4 to the other contacts in the cluster are discarded. For clusters 
corresponding to helices or /5-strand pairings, also contacts which have a 
distance \i — k\ + \j — 1\ = 3 to only one contact (k, I) in the cluster and larger 
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distances to the other contacts are defined as peripheral and discarded. The 
definition of peripheral contacts is more restrictive for these clusters since they 
are typically more compact than clusters corresponding to tertiary interactions 
of helices or sheets. A cluster has to contain at least three contacts. Isolated 
contacts or contact pairs are not taken into account. Discarding peripheral and 
isolated contacts helps to avoid an unreasonably large impact of individual 
contacts on the cluster ECOs, and hence on the model results. 

The following PDB files have been used to determine the contact maps and 
contact clusters: CI2 (1COA); protein L (2PTL, residues 15 to 78); pro- 
tein G (1PGB), src SH3 domain (1SRL); a-spectrin SH3 domain (1SHG); 
ADA2h (1AYE); U1A (1URN, chain A); S6 (IRIS); TNfn3 (1TEN); FNfnlO 
(1FNF, residues 1416 to 1509); Titin (1TIT); CspB (1CSP); L23 (1N88); 
ACBP (2ABD). 

Minimum-ECO sequences 

As defined in section 2, the minimum-ECO sequences to a given cluster C n 
locally minimize the loop-closure cost s in the space of folding sequences. 
Starting with the set of all possible folding sequences to C n , the minimum- 
ECO sequences are obtained by applying the following two rules. 

(1) If two folding sequences a and b have the loop-closure costs s a < s b 
and the set of clusters {C[ a) , C { 2 \ . . . C$, C n } of sequence a is a subset 
of the clusters {C[ b \ C 2 \ ■ ■ ■ C^\C n } of sequence 6, then sequence b is 
discarded. 

(2) If two folding sequences a and b have the loop-closure costs s a > Sb and 
the set of clusters {C[ a) , C ( 2 \ . . . C$,C n } of route a is a proper subset 
of the clusters {C[ b \ C 2 \ ■ ■ .Cf\C n } of sequence 6, then sequence a is 
discarded. 

The rules (1) and (2) are best illustrated in a simple example. Suppose the 
folding sequence CiC 2 C 3 to cluster C 3 has the loop-closure cost s a . Suppose 
now that the cluster C is a cluster which does not affect any of the cluster 
ECOs of Ci, C 2 , or C3. If the cluster Co is, e.g., a local cluster with small 
cluster ECO the cost Sb of the sequence CoC\C 2 C^ is only slightly larger 
than the cost s a of the sequence C\C 2 C^. However, since there is no ECO- 
dependence between C and the other three clusters, the sequence C CiC 2 C s 
is not a reasonable candidate for a minimum-ECO sequence to cluster C 3 , and 
hence is discarded by rule (1). 

On the other hand, let's suppose that the sequence C1C3 has a larger cost Sb 
than the sequence C\C 2 C ?S . This means that the prior formation of C 2 affects 
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the ECO of C 3 . Therefore, the sequence CiC 3 is discarded by rule (2) from 
the possible minimum-ECO sequences to C 3 . 



Secondary structure classification 



The calculation of average experimental <3>-values for helices and strands re- 
quires secondary structure classifications. Where possible, the secondary struc- 
ture definitions given in the PDB files (see above) have been used here. The 
PDB files of TNfn3 and the a-spectrin SH3 domain do not contain secondary 
structure classifications. For TNfn3 and the structurally analogous protein 
FNfnlO, secondary structure classifications have been taken from Hamill et 
al 62 and Cota et al. 63 For the a-spectrin SH3 domain, the secondary structure 
definition of the DSSP algorithm 68 has been used. In the case of CspB, the 
first two substrands given in the PDB file are combined into strand f3i, the 
last two substrands into fin, and the 3io helix is defined from residues 30 to 
33. The RT loop of the two SH3 domains is an irregular secondary structure 
defined here from residues 14 to 26. 
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Table 1: Loop-closure events on minimum-ECO routes 
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Table 2: Average values and kinetic impact of secondary structural elements 
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Caption Table 2: 

The average $-values have been calculated from data published in the fol- 
lowing articles: CI2, 56 protein L, 50 protein G, 51 src SH3, 57 a-spectrin SH3, 58 
Sso7d, 20 ADA2h, 59 U1A, 60 S6, 61 TNfn3, 62 FNfnlO, 63 Titin, 64 CspB, 65 L23, 66 
ACBP. 67 The number in brackets behind an average $- value indicates the 
number of residues in the secondary element for which $-values have been 
measured. Averages taken from many <£>-values are more reliable. The kinetic 
impact of the secondary elements is derived from the results shown in Table 
1 and can attain the values low (L), medium (M), or high (H). 
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Fig. 1. The effective contact order (ECO) for the contact C2 is the length of the 
shortest path between the two residues i and j forming the contact. The 'steps' in 
this shortest-path problem are either covalent bonds between adjacent residues, or 
noncovalent contacts formed previously in the folding process such as the contact 
C\. In this example, the ECO for the contact C2 is 5, since the shortest path (shown 
in red) involves two steps from i to k, one step for the contact C\ between k and I, 
and two steps from I to j. The ECO is a measure for the length of the loop which 
has to be closed to form the contact. In contrast, the contact order (CO) of C2 is 
the sequence separation \i — j\ between the two residues, the number of residues 
along the blue path between i and j. In this example, the CO for the contact C2 is 
10. 
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Fig. 2. Contact maps and contact clusters of the 15 proteins considered here. 
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Fig. 3. Correlation cofficients r for the comparison between average experimental 
values and kinetic impact of the 12 proteins with polarized <j?-vahie distributions. 
The light grey bars represent the correlation coefficients of proteins with negative 
average <3?- values below -0.1 in one of the secondary elements. On average, the corre- 
lation coefficient is 0.62 for all 12 proteins, and 0.74 for the 9 proteins with positive 
<3?-values. For U1A, the correlation coefficients for the two $-value distributions at 
(3 = 0.5 and (3 = 0.7 (see Table 2) are 0.91 and 0.79. Here, the average 0.85 of these 
two values is presented. - To test the statistical significance of the observed correla- 
tions, one can compare the obtained correlation coefficient with those between the 
theoretical distribution and all possible random permutations of the experimental 
distribution for each of the proteins. The fraction p of random permutations of the 
experimental data which have an equally high or larger correlation coefficient with 
the theoretical distribution can be interpreted as probability to obtain the correla- 
tions shown in the Figure, or larger ones, by chance. This probability is p = 0.017 
for src SH3, p = 0.20 for a-spectrin SH3, p = 0.10 for Sso7d, p = 0.17 for ADA2h, 
p = 0.033 and 0.067 for U1A, p = 0.033 for protein L, p = 0.30 for protein G, 
p = 0.29 for TNfn3, p = 0.026 for Titin, p = 0.17 for CspB, and p = 0.036 for 
L23. Despite the relatively small number of data points (the proteins have between 
4 and 7 secondary structural elements), the obtained correlations are statistically 
significant. The probability p for obtaining an average correlation coefficient of 0.62 
or larger for all 12 proteins by chance is smaller than 10~ 6 . 
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